The YOCO architecture is a decoder-decoder model that reduces GPU memory demands while retaining global attention capabilities. It consists of a self-decoder and cross-decoder, allowing for efficient caching and reuse of key-value pairs. YOCO achieves favorable performance compared to traditional Transformers, with significant improvements in inference memory, latency, and throughput, making it suitable for large language models and long context lengths.
Friday, May 10, 2024